RenderFlow

Single-Step Neural Rendering via Flow Matching

Zhang et al.
Presented by Manish Mathai

March 5, 2026

What is Rendering?

Cornell Box
Stanford Dragon
RenderFlow Dataset

The process of turning a 3D scene into a 2D image

Scene

Geometry
Materials
Lights
  • Geometry: surfaces made of millions of tiny triangles
  • Materials: color, roughness, metallic – how surfaces look
  • Lights: environment maps, point lights, the sun

Path Tracing

  • The gold standard for realism
  • Trace rays of light as they bounce around the scene
  • Each bounce picks a random direction. It needs thousands of rays to converge
  • Problem: very expensive
    • Convergence can take minutes to hours per image
    • Need many samples(light rays) per pixel to reduce noise

Samples vs. Quality

1 SPP (top-left) to 32,768 SPP (bottom-right), doubling left to right.

What if we could skip the expensive sampling entirely?

Rasterization

  • Project triangles to pixels instead of tracing rays
  • First pass: Store scene properties as images called G-buffers:
    • Albedo (base color), Normals, Depth, Roughness, Metallic
  • Second pass: Combines these for fast shading
    • No global effects: soft shadows, reflections, indirect light. They are faked with cheaper, rough approximations
  • This pipeline is called deferred rendering

The Gap

Path Tracing

  • Physically accurate
  • Minutes to hours per frame
  • The “ground truth”

Deferred Rendering

  • Real-time (milliseconds)
  • Misses global illumination
  • G-buffers are cheap to produce

Path Tracing vs Rasterization

The Quality Gap: Interiors

Rasterization (left) vs path tracing (right)

The Quality Gap: Subtle Details

Best of both worlds?




Can we get path-tracing quality from G-buffers… using a neural network?

Prior Work: Diffusion Models

  • Models like Stable Diffusion, DALL-E learn to generate images from noise
  • Forward process: gradually add noise to an image until it’s pure static
  • Reverse process: a neural network learns to undo the noise, step by step
  • Typically needs 20-50 denoising steps to produce a clean image
  • What if we condition the reverse process on G-buffers to guide it toward a rendered image?

RGB-X (SIGGRAPH 2024)

  • Condition a diffusion model on G-buffers to synthesize realistic images
  • Estimates intrinsic channels: the surface properties like albedo, normals, etc.
  • Also works in reverse: RGB image -> G-buffer decomposition
  • ~50 denoising steps, ~2.2 seconds per frame

DiffusionRenderer (CVPR 2025)

  • Extends the idea to video using a video diffusion model
  • Handles temporal consistency across frames
  • Trained on synthetic + auto-labeled real-world data
  • Enables relighting, material editing, and object insertion from a single video
  • ~30 denoising steps, ~1.4 seconds per frame

But Two Problems…

  • Slow: 20-50 denoising steps per frame
    • RGB-X: ~2.2 seconds per frame
    • DiffusionRenderer: ~1.4 seconds per frame
    • Not real-time (need < 33ms for 30fps)
  • Stochastic: different random seeds produce different results
    • Flickering between frames: shadows appearing and disappearing, lighting and reflections shifting
    • Not reproducible, making it bad for production pipelines

Flow Matching

  • Alternative to diffusion: learn a velocity field that transports samples from source to target distribution
  • Deterministic: follows a deterministic ODE instead of a stochastic process
    • The same input gives the same output every time
  • Rectified flow: encourages straight-line trajectories between paired samples
    • Straight paths incur zero discretization error – can be solved in as few as 1 Euler step

RenderFlow: Key Idea

  • Learn a single-step flow: G-buffers -> rendered image
  • Key insight: replace noise with albedo as the starting point
    • Already spatially aligned with the target. It has the right colors, textures, structure
    • Network only learns the residual: shadows, reflections, global illumination
    • Much smaller “distance” than noise -> image. A single step suffices
  • Single forward pass: ~0.19s per frame (10x faster than diffusion methods)

How to Train It: Bridge Matching

  • Pure flow matching trains on exact straight-line paths. That can be brittle
  • Bridge matching: add small noise perturbations during training only
    • \(z_t = (1-t)z_0 + tz_1 + \sigma\sqrt{t(1-t)}\epsilon\)
    • \(z_0\) = albedo, \(z_1\) = rendered image, \(\sigma\) = noise scale, \(\epsilon\) = random noise
    • Acts as a regularizer and the model sees diverse variations of the path
    • When \(\sigma = 0\), reduces to pure flow matching
  • Result: more robust to variations in lighting and materials
  • Inference remains deterministic – noise is a training trick only

Train Multi-Step, Infer Single-Step

  • Train with bridge matching at 4 discrete timesteps [1.0, 0.75, 0.5, 0.25], \(\sigma = 0.005\)
  • But infer in 1 step. Just one forward pass
  • Why does this work?
    • Multi-step training exposes the model to intermediate states
    • Single-step inference avoids error accumulation across steps
Training Inference PSNR
4-step ODE 4 steps 23.09
4-step ODE 1 step 23.30
4-step SDE 4 steps 23.38
4-step SDE 1 step 23.59

1-step inference outperforms multi-step because fewer steps means less error accumulation

PSNR (Peak Signal-to-Noise Ratio): higher = better. Ablation at 256x256.

Architecture

  • Repurposes a pretrained video diffusion model (Wan2.1, 1.3B parameter DiT)
    • Albedo replaces noise input; text cross-attention removed
  • All inputs (G-buffers + environment map) encoded by VAE into latent space
  • G-buffer tokens added element-wise to albedo tokens (spatially aligned as same pixel locations)
  • Envmap Adapter: environment map injected via adaptive normalization (scale + shift) because not spatially aligned like G-buffers

Training Losses

  • Latent loss: bridge matching loss in VAE latent space (the core objective)
  • Pixel losses applied after decoding back to image space:
    • LPIPS: perceptual similarity (captures structural differences humans notice)
    • Gradient loss: preserves high-frequency details like contact shadows
  • Total: \(\mathcal{L}_{total} = \mathcal{L}_{latent} + \lambda \mathcal{L}_{pixel}\)

Video Inference

  • Model trained on short clips (5 frames) for memory efficiency
  • Long videos rendered in overlapping chunks:
    • Last frame of chunk N becomes the conditioning frame for chunk N+1
    • Promotes smooth transitions and temporal coherence
  • Combined with keyframe guidance for best results

Keyframe Guidance

  • G-buffers alone lack global lighting info (shadows, reflections are ambiguous)
  • Solution: feed sparse path-traced keyframes as additional guidance
    • e.g., one high-quality reference frame every 16 frames
  • Keyframe Adapter: cross-attention branch injected into each transformer block
    • Uses RoPE to encode temporal distance between keyframe and current frame
  • Two-stage training: train base model first, freeze it, then train only the adapter
    • Base performance unchanged when no keyframes are provided

Keyframe Guidance: Impact

Keyframe Gap PSNR
No keyframes 24.02
Every 49 frames 25.92
Every 25 frames 26.57
Every 13 frames 29.72
  • Even sparse keyframes (every 49 frames) significantly outperform no guidance
  • More keyframes = better quality, as expected
  • Negligible speed impact (~0.24s vs ~0.19s per frame)

Ablation at 256x256 (Supplementary Table S1); full-resolution results in Table 1

Inverse Rendering

  • Can we run the model backwards? RGB image -> G-buffers?
  • Freeze the entire forward model, add lightweight adapters:
    • LoRA on self-attention (same pattern as LLM fine-tuning)
    • Cross-attention conditioned on a text prompt (“albedo”, “normal”, etc.)
    • Per-intrinsic MLP heads for each output type
  • One unified model handles both forward and inverse rendering

Results: Quantitative

Traditional baseline (not a neural method):

Method Paradigm Params PSNR LPIPS Time (s)
Deferred Traditional - 24.65 0.097 real-time

Neural rendering methods:

Method Paradigm Params PSNR LPIPS Time (s)
RGB-X Diffusion 950M 20.98 0.165 ~2.19
DiffusionRenderer Diffusion 1.7B 23.76 0.128 ~1.40
Ours (w/o key) Flow 1.4B 24.21 0.113 ~0.19
Ours (w/ key) Flow 1.7B 26.66 0.101 ~0.24

LPIPS (Learned Perceptual Image Patch Similarity): lower = better perceptual quality

  • 10x faster than RGB-X, 7x faster than DiffusionRenderer
  • Outperforms both neural baselines on all metrics, even without keyframes
  • With keyframes: surpasses even traditional deferred rendering

Results: Deterministic

  • RenderFlow: zero variance across runs (deterministic)
  • Diffusion baselines: significant variance (stochastic)
  • Same input always produces the exact same output
  • Critical for production: no flickering, reproducible results

Results: Visual Comparison

Dataset

  • No existing large-scale rendering dataset with G-buffers + environment maps
  • Built a custom dataset using Unreal Engine 5 Movie Render Queue:
    • Artist-crafted: 30,000 frames from professional scenes
    • Procedural: 100,000 frames from randomly composed scenes
      • 4,000 unique meshes, 30 HDR environment maps
      • Randomized material attributes for diversity
  • All rendered at 512x512, 256 SPP, denoised with Intel Open Image Denoise
  • Both baselines (RGB-X, DiffusionRenderer) fine-tuned on this same dataset

Limitations

  • VAE bottleneck: encoder/decoder accounts for ~90% of inference time
    • The transformer itself is fast; the VAE is the constraint
  • Dataset diversity: synthetic scenes only, limited lighting phenomena and geometric complexity
    • Fails on highly complex geometries (fine-grained details lost in VAE compression)
  • Temporal blurring: causal VAE convolution causes later frames to blur
    • Initial frame stays sharp; subsequent frames progressively soften
  • Resolution: trained and evaluated at 512x512 only

Discussion

  • Key takeaway: flow matching + albedo starting point = single-step rendering
  • Three contributions:
    1. Single-step flow-based rendering (10x faster, deterministic)
    2. Keyframe guidance adapter (significant quality boost)
    3. Inverse rendering via frozen backbone + adapters
  • Open questions:
    • Can the VAE bottleneck be eliminated?
    • How does this scale to 1080p or 4K?
    • Could this work with real-world captured scenes (not just Unreal Engine 5)?
    • What about dynamic lighting changes within a sequence?

Thank You!

Questions and discussion